Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APK Parser (#1953) #1956

Merged
merged 23 commits into from
Nov 16, 2023
Merged

APK Parser (#1953) #1956

merged 23 commits into from
Nov 16, 2023

Conversation

wladimirleite
Copy link
Member

Closes #1953.
I managed to overcome the main limitations of the library used to parse the APKs.
There are a couple of TODO's related to certificates dates metadata that can be solved during review.

Processing 1K APKs (~12 GB) now took about 1 minute, while it was taking ~25 minutes before (with the default PackageParser). Index had 2.5 GB before, and now 20 MB.

A sample output of the parser:
image

@lfcnassif
Copy link
Member

Thank you very much @tc-wleite!

@wladimirleite
Copy link
Member Author

wladimirleite commented Nov 13, 2023

I ran a performance test with the 4 UDFRs which were extracted from Android phones used in #1912 tests, comparing total processing time between master branch and this PR (which includes an APK parser):

# Type Size (GB) Total Items Master Processing Time (s) Processing Time (s) using the APK Parser Processing Time Reduction
1 UFDR 2.4 57,495 555 500 10%
2 UFDR 18 118,297 1,494 649 57%
5 UFDR 33 697,193 3,553 3,008 15%
6 UFDR 37 1,596,925 4,342 3,546 18%

So, using the APK parser would reduce a bit the processing time of Android UFDRs.

@patrickdalla
Copy link
Collaborator

patrickdalla commented Nov 13, 2023 via email

@wladimirleite
Copy link
Member Author

wladimirleite commented Nov 13, 2023

I will review tomorrow your code.
Could you share your test cases with me?

Sure!
I just used APKs exported from some Android UFDRs I am working with (and for some test the UFDRs themselves).
I will export some of these APKs and send them to you through Teams our internal WebDAV server.

@wladimirleite
Copy link
Member Author

I will export some of these APKs and send them to you through Teams our internal WebDAV server.

I just copied a 7-Zip (~14 GB) with some samples to \Temp\Wladimir.

@lfcnassif
Copy link
Member

could you share your test cases with me?

I'll start a crawl in BSB cases database through the night.

@lfcnassif
Copy link
Member

could you share your test cases with me?

@patrickdalla I just shared 37K APKs from 87 cases with you.

@patrickdalla
Copy link
Collaborator

patrickdalla commented Nov 14, 2023

Hi @tc-wleite , IPED already have an certificate parser. I extracted the certificate data and enabled CertificateParser in ParsersConfig.xml (not enabled by default) and resulted in something like the image below. This could be a reasonable approach?

@lfcnassif I don't remember well why CertificateParser is not enabled by default. What I remember is that the certificate carver carves a lot of certificates, many of them duplicated many times. Can we reenable it?

image

@wladimirleite
Copy link
Member Author

wladimirleite commented Nov 14, 2023

Hi @tc-wleite , IPED already have an certificate parser. I extracted the certificate data and enabled CertificateParser in ParsersConfig.xml (not enabled by default) and resulted in something like the image below. This could be a reasonable approach?

Hi @patrickdalla!
Do you mean extract certificate data as subitems of APK? It seems a good idea. In such case, metadata set in the APK could be removed, as it would be set for its children.

Just a couple of minor details:

  • I think that setting the type of the properties "certificate:notafter" and "certificate:notbefore" as dates would avoid the extra properties with ":date" suffix.
  • Most property names use camel case, so perhaps "notAfter" and "notBefore" would be more readable.

@lfcnassif
Copy link
Member

lfcnassif commented Nov 14, 2023

@lfcnassif I don't remember well why CertificateParser is not enabled by default. What I remember is that the certificate carver carves a lot of certificates, many of them duplicated many times. Can we reenable it?

I don't remember either, maybe it caused a performance impact, and since most cases don't need certificate parsing information... Or maybe the impact was caused by the Certificate carver, since it recovers a lot of items. One option would be enabling the parser and keeping the carver disabled by default. By the way, Tika has a Pkcs7Parser, also disabled by default intentionally a long ago, which parses a subset of mimes handled by your parser, and ideally a comparison between them would be good, not sure if it was done.

PS: To enable CertificateParser, we should create an issue to report it in the 4.2.0 release notes.

@patrickdalla
Copy link
Collaborator

I have just reviewed Pkcs7Parser code from tika.

Pkcs7 is a container spec to hold content and its signature info in same file/stream. Pkcs7Parser of tika only strips/ignores the signature and delegate the content parsing to the corresponding parser. Pkcs7Parser doesn't parse any signature and respectives certification information.

Pkcs7 is most used to save certification revogation list and certification files itself (when included with entire certificates of certification path). The CertificateParser uses java.security.cert.CertificateFactory that can extract the certificates these files PKCS7 formatted content.

PKCS7 is not the format of the certificate used to sign the APK.

It seems from https://issues.apache.org/jira/browse/TIKA-3205, code done after the implementation of CertificateParser, that TIKA didn't classified PEM and DER files as "x-x509-ca-cert". But now it do.

I have created in CertificateParser "application/x-pem-file" and "application/pkix-cert" mime-types to identify this kind of content, but now it seems it can use the new "application/x-x509-ca-cert" identified by Tika.

I will create an issue with this same info.

@patrickdalla
Copy link
Collaborator

Anyway, I will finish reviewing this PR. If CategoriesToExpand.txt is configured to expand "Android Applications" category, it will create a new subitem with the certificate data. Tika is classifying this new subitem as application/x-x509-cert. When CertificateParser is reviewed later, it will parse this accordingly.

@lfcnassif
Copy link
Member

I have just reviewed Pkcs7Parser code from tika.

Pkcs7 is a container spec to hold content and its signature info in same file/stream. Pkcs7Parser of tika only strips/ignores the signature and delegate the content parsing to the corresponding parser. Pkcs7Parser doesn't parse any signature and respectives certification information.

Pkcs7 is most used to save certification revogation list and certification files itself (when included with entire certificates of certification path). The CertificateParser uses java.security.cert.CertificateFactory that can extract the certificates these files PKCS7 formatted content.

PKCS7 is not the format of the certificate used to sign the APK.

It seems from https://issues.apache.org/jira/browse/TIKA-3205, code done after the implementation of CertificateParser, that TIKA didn't classified PEM and DER files as "x-x509-ca-cert". But now it do.

I have created in CertificateParser "application/x-pem-file" and "application/pkix-cert" mime-types to identify this kind of content, but now it seems it can use the new "application/x-x509-ca-cert" identified by Tika.

I will create an issue with this same info.

Thank you @patrickdalla for your analysis!

Copy link
Collaborator

@patrickdalla patrickdalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After internationalization code and the creation of specific category for APK, it is good.

@lfcnassif
Copy link
Member

Thanks @patrickdalla for your review. @tc-wleite is @patrickdalla approach fine to you?

@wladimirleite
Copy link
Member Author

Thanks @patrickdalla for your review. @tc-wleite is @patrickdalla approach fine to you?

Sure, it seems fine! I can run a last test before merging, but I will be able to do that in the afternoon.

@lfcnassif
Copy link
Member

Sure, it seems fine! I can run a last test before merging, but I will be able to do that in the afternoon.

That would be great! But please enjoy the holiday, this is not urgent.

@wladimirleite
Copy link
Member Author

Hi @lfcnassif!
I ran a large test and everything seems fine.
Just made a few minor (mostly visual) adjustments.

@lfcnassif
Copy link
Member

Thank you @tc-wleite for your adjustments and for testing again! Before merging, I would like to do 2 comments:

@patrickdalla
Copy link
Collaborator

patrickdalla commented Nov 16, 2023 via email

@lfcnassif
Copy link
Member

Thanks @patrickdalla. I also removed the changes in CertificateParser, since most of them are included in #1981 to avoid possible merge conflicts.

@lfcnassif lfcnassif merged commit 28152cd into master Nov 16, 2023
2 checks passed
@lfcnassif lfcnassif deleted the #1953_APKParser branch November 16, 2023 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

APK Parser
3 participants